Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
Template
spark = (
SparkSession.builder
.master("local")
.appName("Exploring Joins")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
Create a DataFrame
schema = T.StructType([
T.StructField("pet_id", T.IntegerType(), False),
T.StructField("name", T.StringType(), True),
T.StructField("age", T.IntegerType(), True),
])
data = [
(1, "Bear", 13),
(2, "Chewie", 12),
(2, "Roger", 1),
]
pet_df = spark.createDataFrame(
data=data,
schema=schema
)
pet_df.toPandas()
| pet_id | name | age | |
|---|---|---|---|
| 0 | 1 | Bear | 13 |
| 1 | 2 | Chewie | 12 |
| 2 | 2 | Roger | 1 |
Background
There are 3 datatypes in spark RDD, DataFrame and Dataset. As mentioned before, we will focus on the DataFrame datatype.
- This is most performant and commonly used datatype.
RDDs are a thing of the past and you should refrain from using them unless you can't do the transformation inDataFrames.Datasets are a thing inSpark scala.
If you have used a DataFrame in Pandas, this is the same thing. If you haven't, a dataframe is similar to a csv or excel file. There are columns and rows that you can perform transformations on. You can search online for better descriptions of what a DataFrame is.
What Happened?
For any DataFrame (df) that you work with in Spark you should provide it with 2 things:
- a
schemafor the data. Providing aschemaexplicitly makes it clearer to the reader and sometimes even more performant, if we can know that a column isnullable. This means providing 3 things:- the
nameof the column - the
datatypeof the column - the
nullabilityof the column
- the
- the data. Normally you would read data stored in
gcs,awsetc and store it in adf, but there will be the off-times that you will need to create one.